Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR
نویسندگان
چکیده
German is a morphologically rich language having a high degree of word inflections, derivations and compounding. This leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities in the large vocabulary continuous speech recognition (LVCSR) systems. One of the main challenges in the German LVCSR is the recognition of the OOV words. For this purpose, data-driven morphemes are used to provide higher lexical coverage. On the other hand, the probability estimates of a sub-lexical LM could be further improved using feature-rich LMs like maximum entropy (MaxEnt) and class-based LMs. In this work, for a sub-lexical level German LVCSR task, we investigate the use of the multiple morpheme level features as classes for building class-based LMs that are estimated using the state-of-the-art MaxEnt approach. Thus, the benefits of both the MaxEnt LMs and the traditional class-based LMs are effectively combined. Furthermore, we experiment the use of Maximum a-posteriori adaptation over the MaxEnt class-based LMs. We show consistent reductions in both the OOV recognition error rate and the word error rate (WER) on a German LVCSR task from the Quaero project, compared to the traditional class-based and the N -gram morpheme based LM.
منابع مشابه
Investigation on language modelling approaches for open vocabulary speech recognition
By definition, words that are not present in a recognition vocabulary are called out-of-vocabulary (OOV) words. Recognition of unseen or new words is an important feature that is always desired in any real-world large vocabulary continuous speech recognition (LVCSR) system. However, human languages are complex in nature due to wide varieties of morphological richness such as inflections, deriva...
متن کاملInvestigation of Maximum Entropy Hybrid Language Models for Open Vocabulary German and Polish LVCSR
For languages like German and Polish, higher numbers of word inflections lead to high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Thus, one of the main challenges in large vocabulary continuous speech recognition (LVCSR) is recognizing an open vocabulary. In this paper, we investigate the use of mixed type of sub-word units in the same recognition lexicon. Namely, m...
متن کاملMorpheme Level Feature-based Language Models for German LVCSR
One of the challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) of German is its complex morphology and high level of compounding. It leads to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probabilities. In such cases, building LMs on morpheme level can be considered a better choice. Thereby, higher lexical coverage and lower LM perplexities are achieved. On ...
متن کاملHybrid Language Models Using Mixed Types of Sub-Lexical Units for Open Vocabulary German LVCSR
German is a highly inflected language with a large number of words derived from the same root. It makes use of a high degree of word compounding leading to high Out-of-vocabulary (OOV) rates, and Language Model (LM) perplexities. For such languages the use of sub-lexical units for Large Vocabulary Continuous Speech Recognition (LVCSR) becomes a natural choice. In this paper, we investigate the ...
متن کاملThe RWTH Aachen German and English LVCSR systems for IWSLT-2013
In this paper, German and English large vocabulary continuous speech recognition (LVCSR) systems developed by the RWTH Aachen University for the IWSLT-2013 evaluation campaign are presented. Good improvements are obtained with state-of-the-art monolingual and multilingual bottleneck features. In addition, an open vocabulary approach using morphemic sub-lexical units is investigated along with t...
متن کامل